* San Francisco skyline Marco Brivio | Getty Images
Introduction to the project and data
The dataset used in the project is based on data collected from the 1990 California Census and was initially featured in the following paper:
Pace, R. Kelley, and Ronald Barry. “Sparse spatial autoregressions.” Statistics & Probability Letters 33.3 (1997): 291-297.
Each observation in the dataset includes information on blocks (different neighborhoods) in California in 1990. Although we may not directly apply the results in this project to California housing prices in the present, we will implement machine learning techniques to predict the median_house_value in California during that time period.
Variables in the housing dataset:
longitude- a geographic coordinate that specifies the east-west position of a point on the Earth’s surfacelatitude- a geographic coordinate that specifies the north-south position of a point on the Earth’s surfacehousing_median_age- median age of a house within a block; the lower the number, the newer the hometotal_rooms- total number of rooms within a blocktotal_bedrooms- total number of bedrooms within a blockpopulation- total number of people residing within a blockhouseholds- total number of households, a group of people residing within a home unit within a blockmedian_income- median income for households within a block of houses (units in tens of thousands of US dollars \(\rightarrow\) units in US dollars after cleaning)median_house_value- median house value for households within a block of houses (units in US dollars)ocean_proximity- indicator where the block is located geographically within California
Data import and cleaning
# importing the data
housing <- read_csv(file="housing.csv", col_names=TRUE);
# number of observations in data frame
nrow1 <- nrow(housing);
# remove missing values
housing <- na.omit(housing);
# number of observations in data frame after removing NAs
nrow2 <- nrow(housing);
# make ocean_proximity lower case
housing$ocean_proximity <- tolower(housing$ocean_proximity);
# changing units of tens of thousand dollars to dollars
housing$median_income <- 10000 * housing$median_income;
kbl(head(housing, 25), caption="California housing dataset (first 25 observations)", align="c") %>%
kable_paper() %>%
scroll_box(width="100%", height="500px");| longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | ocean_proximity |
|---|---|---|---|---|---|---|---|---|---|
| -122.23 | 37.88 | 41 | 880 | 129 | 322 | 126 | 83252 | 452600 | near bay |
| -122.22 | 37.86 | 21 | 7099 | 1106 | 2401 | 1138 | 83014 | 358500 | near bay |
| -122.24 | 37.85 | 52 | 1467 | 190 | 496 | 177 | 72574 | 352100 | near bay |
| -122.25 | 37.85 | 52 | 1274 | 235 | 558 | 219 | 56431 | 341300 | near bay |
| -122.25 | 37.85 | 52 | 1627 | 280 | 565 | 259 | 38462 | 342200 | near bay |
| -122.25 | 37.85 | 52 | 919 | 213 | 413 | 193 | 40368 | 269700 | near bay |
| -122.25 | 37.84 | 52 | 2535 | 489 | 1094 | 514 | 36591 | 299200 | near bay |
| -122.25 | 37.84 | 52 | 3104 | 687 | 1157 | 647 | 31200 | 241400 | near bay |
| -122.26 | 37.84 | 42 | 2555 | 665 | 1206 | 595 | 20804 | 226700 | near bay |
| -122.25 | 37.84 | 52 | 3549 | 707 | 1551 | 714 | 36912 | 261100 | near bay |
| -122.26 | 37.85 | 52 | 2202 | 434 | 910 | 402 | 32031 | 281500 | near bay |
| -122.26 | 37.85 | 52 | 3503 | 752 | 1504 | 734 | 32705 | 241800 | near bay |
| -122.26 | 37.85 | 52 | 2491 | 474 | 1098 | 468 | 30750 | 213500 | near bay |
| -122.26 | 37.84 | 52 | 696 | 191 | 345 | 174 | 26736 | 191300 | near bay |
| -122.26 | 37.85 | 52 | 2643 | 626 | 1212 | 620 | 19167 | 159200 | near bay |
| -122.26 | 37.85 | 50 | 1120 | 283 | 697 | 264 | 21250 | 140000 | near bay |
| -122.27 | 37.85 | 52 | 1966 | 347 | 793 | 331 | 27750 | 152500 | near bay |
| -122.27 | 37.85 | 52 | 1228 | 293 | 648 | 303 | 21202 | 155500 | near bay |
| -122.26 | 37.84 | 50 | 2239 | 455 | 990 | 419 | 19911 | 158700 | near bay |
| -122.27 | 37.84 | 52 | 1503 | 298 | 690 | 275 | 26033 | 162900 | near bay |
| -122.27 | 37.85 | 40 | 751 | 184 | 409 | 166 | 13578 | 147500 | near bay |
| -122.27 | 37.85 | 42 | 1639 | 367 | 929 | 366 | 17135 | 159800 | near bay |
| -122.27 | 37.84 | 52 | 2436 | 541 | 1015 | 478 | 17250 | 113900 | near bay |
| -122.27 | 37.84 | 52 | 1688 | 337 | 853 | 325 | 21806 | 99700 | near bay |
| -122.27 | 37.84 | 52 | 2224 | 437 | 1006 | 422 | 26000 | 132600 | near bay |
207 observations were removed because there were missing values in the total_bedrooms column. The dataset now contains 20433 observations.
Exploratory, descriptive statistics and graphics
# geographic data on CA
ca <- map_data("state") %>%
subset(region == "california");
# plotting points from CA housing dataset
ggplotly(
ggplot() +
geom_polygon(data=ca, aes(x = long, y = lat), fill = "light grey", color = "black") +
geom_point(data=housing, aes(longitude, latitude, col=median_house_value, label=population, label2=ocean_proximity,
label3=median_income), size=0.75) +
theme_bw() +
labs(x="Longitude", y="Latitude", title="California blocks geographic location"));From the visualization above, we deduce that many blocks in the Bay Area and Southern California Coast have a higher median_house_value compared to blocks in the rest of the state.
# data frame with total counts of each ocean_proximity observation
ocean_dist <- housing %>%
select(ocean_proximity) %>%
group_by(ocean_proximity) %>%
summarise(total=length(ocean_proximity));
# order columns based on counts
ocean_dist$ocean_proximity <- reorder(ocean_dist$ocean_proximity, -ocean_dist$total)
# bar chart
ggplotly(
ggplot(ocean_dist, aes(ocean_proximity, total)) +
geom_bar(stat='identity', fill="dodgerblue") +
labs(x="Ocean proximity", y="Total block groups",
title="Barchart of total block groups by ocean proximity in California"));The bar chart above depicts the distribution of total number of blocks in California by ocean_proximity. The majority of CA blocks reside less than 1 hour from the ocean (9034 blocks) and inland (6496 blocks). The least amount of blocks in CA are on an island with 5 blocks.
# box plot median house value vs ocean proximity
options(scipen=10000);
ggplotly(
ggplot(housing) +
geom_boxplot(aes(ocean_proximity, median_house_value), fill="dodgerblue") +
labs(x="Ocean proximity", y="Median house value",
title="Boxplots of median house value by ocean proximity in California"));The boxplots above depicts the median_house_value by ocean_proximity. From the chart, we see that blocks less than one hour from the ocean, near bay, and near ocean have median median_house_price at around $220,000. Island blocks have the largest median median_house_value at almost $415,000. Inland blocks have the lowest median median_house_value at just under $109,000. Blocks < 1h ocean, near bay, and near ocean have similar ranges of median_house_values. The majority of islands blocks have high median_house_value compared to other regions. The majority of inland blocks are cheap compared to the rest, but there are outliers with prices that rival the other regions.
# summary statistics
housing.mean <- sapply(housing[ , 3:9], mean);
housing.min <- sapply(housing[ , 3:9], min);
housing.max <- sapply(housing[ , 3:9], max);
housing.sd <- sapply(housing[ , 3:9], sd);
housing.median <- sapply(housing[ , 3:9], median);
housing.summary <- rbind(average=housing.mean, minimum=housing.min, maximum=housing.max, "standard deviation"=housing.sd, median=housing.median)
housing.summary <- round(housing.summary, 2);
kable(housing.summary, caption="Summary statistics for quantitative variables", align="c");| housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | |
|---|---|---|---|---|---|---|---|
| average | 28.63 | 2636.50 | 537.87 | 1424.95 | 499.43 | 38711.62 | 206864.4 |
| minimum | 1.00 | 2.00 | 1.00 | 3.00 | 1.00 | 4999.00 | 14999.0 |
| maximum | 52.00 | 39320.00 | 6445.00 | 35682.00 | 6082.00 | 150001.00 | 500001.0 |
| standard deviation | 12.59 | 2185.27 | 421.39 | 1133.21 | 382.30 | 18992.91 | 115435.7 |
| median | 29.00 | 2127.00 | 435.00 | 1166.00 | 409.00 | 35365.00 | 179700.0 |
# average for each variable by ocean_proximity
housing.mean.OP <- housing[ , 3:10] %>%
group_by(ocean_proximity) %>%
summarise_all(mean, na.rm=T);
# minimum for each variable by ocean_proximity
housing.min.OP <- housing[ , 3:10] %>%
group_by(ocean_proximity) %>%
summarise_all(min, na.rm=T);
# maximum for each variable by ocean_proximity
housing.max.OP <- housing[ , 3:10] %>%
group_by(ocean_proximity) %>%
summarise_all(max, na.rm=T);
# standard deviation for each variable by ocean_proximity
housing.sd.OP <- housing[ , 3:10] %>%
group_by(ocean_proximity) %>%
summarise_all(sd, na.rm=T);
# median for each variable by ocean_proximity
housing.median.OP <- housing[ , 3:10] %>%
group_by(ocean_proximity) %>%
summarise_all(median, na.rm=T);
housing.summary.op <- rbind(housing.mean.OP, housing.min.OP, housing.max.OP,
housing.sd.OP, housing.median.OP);
# column that indicates the summary statistic applied
housing.summary.op$summary_stat <- c(rep("mean", 5), rep("min", 5), rep("max", 5),
rep("sd", 5), rep("median", 5))
housing.summary.op <- housing.summary.op %>%
relocate(summary_stat, .after=ocean_proximity) %>%
arrange(ocean_proximity);
housing.summary.op[ , 3:9] <- round(housing.summary.op[ , 3:9], 2);
kable(housing.summary.op, align="c", caption="Summary statistics for variables by ocean proximity");| ocean_proximity | summary_stat | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value |
|---|---|---|---|---|---|---|---|---|
| <1h ocean | mean | 29.28 | 2627.23 | 546.54 | 1518.44 | 517.42 | 42311.01 | 240267.99 |
| <1h ocean | min | 2.00 | 11.00 | 5.00 | 3.00 | 4.00 | 4999.00 | 17500.00 |
| <1h ocean | max | 52.00 | 37937.00 | 6445.00 | 35682.00 | 6082.00 | 150001.00 | 500001.00 |
| <1h ocean | sd | 11.64 | 2164.15 | 427.91 | 1186.98 | 392.78 | 19983.00 | 106198.32 |
| <1h ocean | median | 30.00 | 2107.00 | 438.00 | 1246.00 | 420.00 | 38790.00 | 215000.00 |
| inland | mean | 24.26 | 2721.25 | 533.88 | 1392.41 | 478.01 | 32103.59 | 124896.86 |
| inland | min | 1.00 | 2.00 | 2.00 | 5.00 | 2.00 | 4999.00 | 14999.00 |
| inland | max | 52.00 | 39320.00 | 6210.00 | 16305.00 | 5358.00 | 150001.00 | 500001.00 |
| inland | sd | 12.03 | 2390.70 | 446.12 | 1171.24 | 393.07 | 14371.15 | 70057.96 |
| inland | median | 23.00 | 2136.00 | 423.00 | 1124.50 | 385.00 | 29898.00 | 108700.00 |
| island | mean | 42.40 | 1574.60 | 420.40 | 668.00 | 276.60 | 27444.20 | 380440.00 |
| island | min | 27.00 | 716.00 | 214.00 | 341.00 | 160.00 | 21579.00 | 287500.00 |
| island | max | 52.00 | 2359.00 | 591.00 | 1100.00 | 431.00 | 33906.00 | 450000.00 |
| island | sd | 13.16 | 707.55 | 169.32 | 301.69 | 113.20 | 4441.80 | 80559.56 |
| island | median | 52.00 | 1675.00 | 512.00 | 733.00 | 288.00 | 27361.00 | 414700.00 |
| near bay | mean | 37.76 | 2490.34 | 514.18 | 1227.88 | 487.24 | 41756.47 | 259279.29 |
| near bay | min | 2.00 | 8.00 | 1.00 | 8.00 | 1.00 | 4999.00 | 22500.00 |
| near bay | max | 52.00 | 18634.00 | 3226.00 | 8276.00 | 3052.00 | 150001.00 | 500001.00 |
| near bay | sd | 13.07 | 1823.58 | 367.89 | 876.10 | 344.70 | 20211.20 | 122853.74 |
| near bay | median | 39.00 | 2082.50 | 423.00 | 1033.50 | 404.50 | 38186.50 | 233800.00 |
| near ocean | mean | 29.31 | 2587.17 | 538.62 | 1355.64 | 501.53 | 40063.74 | 249042.36 |
| near ocean | min | 2.00 | 15.00 | 3.00 | 8.00 | 3.00 | 5360.00 | 22500.00 |
| near ocean | max | 52.00 | 30405.00 | 4585.00 | 12873.00 | 4176.00 | 150001.00 | 500001.00 |
| near ocean | sd | 11.84 | 1998.04 | 376.32 | 1008.13 | 345.14 | 20161.57 | 122548.01 |
| near ocean | median | 29.00 | 2197.00 | 464.00 | 1137.50 | 429.00 | 36483.00 | 228750.00 |
Machine Learning Models
In our machine learning models, only the quantitative variables will be used. That is, all the variables excluding longitude, latitude, and ocean_proximity.
# splitting data into training and testing set
training.index <- sample(1:nrow(housing), nrow(housing)/2);
training <- housing[training.index, ] %>% select(!c(longitude, latitude, ocean_proximity));
kable(head(training, 10), caption="Training set (first 10 observations")| housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value |
|---|---|---|---|---|---|---|
| 2 | 838 | 295 | 240 | 149 | 28750 | 237500 |
| 25 | 1709 | 439 | 632 | 292 | 17868 | 45500 |
| 15 | 2150 | 327 | 1094 | 324 | 60224 | 198500 |
| 22 | 3479 | 455 | 1454 | 488 | 66324 | 347600 |
| 4 | 4791 | 695 | 1871 | 659 | 69532 | 277000 |
| 28 | 859 | 199 | 455 | 211 | 23293 | 215900 |
| 16 | 2867 | 559 | 1203 | 449 | 27143 | 95300 |
| 19 | 2164 | 410 | 1309 | 426 | 33380 | 185300 |
| 25 | 3445 | 898 | 5558 | 894 | 30972 | 169300 |
| 41 | 2475 | 532 | 1416 | 470 | 38372 | 156400 |
testing <- housing[-training.index, ] %>% select(!c(longitude, latitude, ocean_proximity));
kable(head(testing, 10), caption="Testing set (first 10 observations)")| housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value |
|---|---|---|---|---|---|---|
| 41 | 880 | 129 | 322 | 126 | 83252 | 452600 |
| 21 | 7099 | 1106 | 2401 | 1138 | 83014 | 358500 |
| 52 | 1467 | 190 | 496 | 177 | 72574 | 352100 |
| 42 | 2555 | 665 | 1206 | 595 | 20804 | 226700 |
| 52 | 3549 | 707 | 1551 | 714 | 36912 | 261100 |
| 52 | 2491 | 474 | 1098 | 468 | 30750 | 213500 |
| 52 | 696 | 191 | 345 | 174 | 26736 | 191300 |
| 52 | 1966 | 347 | 793 | 331 | 27750 | 152500 |
| 52 | 1228 | 293 | 648 | 303 | 21202 | 155500 |
| 50 | 2239 | 455 | 990 | 419 | 19911 | 158700 |
Regularization - Ridge Regression
training.x = model.matrix(median_house_value~. , training)[ , -1];
training.y = training$median_house_value;
testing.x = model.matrix(median_house_value~. , testing)[ , -1];
testing.y = testing$median_house_value;
set.seed(10);
# cross validations to choose lambda
cv.errors <- cv.glmnet(training.x, training.y, alpha=0);
plot(cv.errors);By cross validation, the best lambda for ridge regression is \(\lambda\) = 7987.5455678.
ridge.model <- glmnet(training.x, training.y, alpha=0, lambda = cv.errors$lambda.min);
# coefficients of ridge regression model on testing set
coef(ridge.model);## 7 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) -14991.261587
## housing_median_age 1739.053791
## total_rooms -5.017248
## total_bedrooms 54.769915
## population -27.278244
## households 69.246459
## median_income 4.121757
ridge.predict <- predict(ridge.model, s=cv.errors$lambda.min, newx=testing.x);
ridge.test.error <- mean((ridge.predict - testing.y)^2);Our estimated ridge regression model is:
median_house_value = -14991.26 + 1739.05housing_median_age + -5.02total_rooms + 54.77total_bedrooms + -27.28population + 69.25households + 4.12median_income
with a test error of MSE = 6059454667.61883.
Generalized Additive Models
# applied to the training set
gam1 = gam(median_house_value ~ s(housing_median_age) +
s(total_rooms) + s(total_bedrooms) + s(population) +
s(households) + s(median_income), data=training);
par(mfrow=c(2, 3));
plot(gam1, se=TRUE ,col =" blue");# applied to the testing set
gam2 = gam(median_house_value ~ s(housing_median_age) +
s(total_rooms) + s(total_bedrooms) + s(population) +
s(households) + s(median_income), data=testing);
# summary of gam model on test data
summary(gam2);##
## Call: gam(formula = median_house_value ~ s(housing_median_age) + s(total_rooms) +
## s(total_bedrooms) + s(population) + s(households) + s(median_income),
## data = testing)
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -463720 -45086 -10384 31970 543673
##
## (Dispersion Parameter for gaussian family taken to be 5098901994)
##
## Null Deviance: 136757485535296 on 10216 degrees of freedom
## Residual Deviance: 51968007544944 on 10192 degrees of freedom
## AIC: 257394.9
##
## Number of Local Scoring Iterations: NA
##
## Anova for Parametric Effects
## Df Sum Sq Mean Sq F value
## s(housing_median_age) 1 1802251875454 1802251875454 353.46
## s(total_rooms) 1 5211746561959 5211746561959 1022.13
## s(total_bedrooms) 1 3876069017624 3876069017624 760.18
## s(population) 1 6908955381469 6908955381469 1354.99
## s(households) 1 4271217680826 4271217680826 837.67
## s(median_income) 1 64968481315969 64968481315969 12741.66
## Residuals 10192 51968007544944 5098901994
## Pr(>F)
## s(housing_median_age) < 0.00000000000000022 ***
## s(total_rooms) < 0.00000000000000022 ***
## s(total_bedrooms) < 0.00000000000000022 ***
## s(population) < 0.00000000000000022 ***
## s(households) < 0.00000000000000022 ***
## s(median_income) < 0.00000000000000022 ***
## Residuals
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Anova for Nonparametric Effects
## Npar Df Npar F Pr(F)
## (Intercept)
## s(housing_median_age) 3 18.44 0.000000000006345 ***
## s(total_rooms) 3 310.75 < 0.00000000000000022 ***
## s(total_bedrooms) 3 378.18 < 0.00000000000000022 ***
## s(population) 3 223.76 < 0.00000000000000022 ***
## s(households) 3 151.95 < 0.00000000000000022 ***
## s(median_income) 3 227.99 < 0.00000000000000022 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
par(mfrow=c(1, 1));
pred <- predict(gam2, testing);
# test errors for the predictor variables
housing_median_age_MSE <- mean((testing$housing_median_age-pred)^2);
total_rooms_MSE <- mean((testing$total_rooms-pred)^2);
total_bedrooms_MSE <- mean((testing$total_bedrooms-pred)^2);
population_MSE <- mean((testing$population-pred)^2);
households_MSE <- mean((testing$households-pred)^2);
median_income_MSE <- mean((testing$median_income-pred)^2);
# Test Errors
gam.test.error <- as.data.frame(rbind(housing_median_age_MSE, total_rooms_MSE, total_bedrooms_MSE,
population_MSE, households_MSE, median_income_MSE));
row.names(gam.test.error) <- names(housing[3:8]);
names(gam.test.error) <- "MSE";Our estimated generalized additive model is:
median_house_value = -67960.6 + 1944.04s(housing_median_age) + -29.31s(total_rooms) + 150.42s(total_bedrooms) + -35.78s(population) + 129.44s(households) + 5.22s(median_income)
| MSE | |
|---|---|
| housing_median_age | 51089443027 |
| total_rooms | 49934943084 |
| total_bedrooms | 50869865093 |
| population | 50511724563 |
| households | 50885338294 |
| median_income | 33914834260 |
Polynomial Regression
Based on the matrix plot above, the variable with the highest correlation with median_house_value is median_income with a correlation coefficient of 0.6883555.
# scatterplot median_house_value vs median_income
ggplotly(ggplot(housing) +
geom_point(aes(median_income, median_house_value), size=1, col="dodgerblue") +
labs(x="Median income", y="Median house value", title="Scatterplot of median house value vs median income"));From the scatterplot above, we can clearly see that there is a positive association between median_income and median_house_value.
# cross validation to choose optimal polynomial degree
cv.error=rep(0,10);
for (i in 1:10) {
glm.fit = glm(median_house_value~poly(median_income, i), data = housing[sample(1:nrow(housing), 5000), ]);
cv.error[i] = cv.glm(housing[sample(1:nrow(housing), 5000), ], glm.fit)$delta[1];
}
# plot of test errors by degree
ggplotly(
ggplot() +
geom_point(aes(1:length(cv.error), cv.error), col="blue", size=2) +
geom_line(aes(1:length(cv.error), cv.error), size=0.5) +
labs(x="Polynomial degree", y="Test error (MSE)", title="Test error (MSE) by polynomial degree") +
theme_bw() +
scale_x_continuous(breaks=1:10));## poly.degree MSE
## 1 1 19522304422
## 2 2 19423797649
## 3 3 19430019567
## 4 4 20399643152
## 5 5 19236628607
## 6 6 20195506476
## 7 7 19334874831
## 8 8 19962243574
## 9 9 19680937228
## 10 10 19395387618
optimal.degree <- which(cv.error==min(cv.error));
# polynomial regression model
median_incomelims = range(housing$median_income);
median_income.grid=seq(from=median_incomelims[1], to=median_incomelims[2]);
poly.fit <- lm(median_house_value ~ poly(median_income, optimal.degree), data=housing);
pred = predict(poly.fit, newdata =list(median_income=median_income.grid),se=T);
se.band = cbind(pred$fit +2*pred$se.fit, pred$fit -2* pred$se.fit);
# summary of the polynomial regression model
summary(poly.fit);##
## Call:
## lm(formula = median_house_value ~ poly(median_income, optimal.degree),
## data = housing)
##
## Residuals:
## Min 1Q Median 3Q Max
## -392860 -55972 -16451 35739 413140
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 206864.4 578.6 357.513
## poly(median_income, optimal.degree)1 11358166.4 82710.4 137.325
## poly(median_income, optimal.degree)2 -1110193.0 82710.4 -13.423
## poly(median_income, optimal.degree)3 -1498110.7 82710.4 -18.113
## poly(median_income, optimal.degree)4 165372.8 82710.4 1.999
## poly(median_income, optimal.degree)5 105491.6 82710.4 1.275
## Pr(>|t|)
## (Intercept) <0.0000000000000002 ***
## poly(median_income, optimal.degree)1 <0.0000000000000002 ***
## poly(median_income, optimal.degree)2 <0.0000000000000002 ***
## poly(median_income, optimal.degree)3 <0.0000000000000002 ***
## poly(median_income, optimal.degree)4 0.0456 *
## poly(median_income, optimal.degree)5 0.2022
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 82710 on 20427 degrees of freedom
## Multiple R-squared: 0.4867, Adjusted R-squared: 0.4866
## F-statistic: 3874 on 5 and 20427 DF, p-value: < 0.00000000000000022
# scatter plot with optimal polynomial fit
ggplotly(ggplot(housing, aes(median_income, median_house_value)) +
geom_point(col="dark grey") +
geom_smooth(method = lm, formula = y ~ poly(x, which(cv.error==min(cv.error)), raw = TRUE), se=F, col="blue") +
labs(x="Median income", y="Median house value", title="Median house value vs median income with polynomial of optimal degree"));The optimal degree for polynomial regression is 5.
Our estimated polynomial regression model is:
median_house_value = 206864.41 + 11358166.43median_income + -1110192.98median_income^2 + -1498110.74median_income^3 + 165372.78median_income^4 + 105491.6median_income^5
with a test error of MSE = 19236628606.76
Performance on test data compared across models
As we saw previously,
- The ridge regression model had a test error of
MSE= 6059454667.62. - The generalized additive model had a test error for each predictor variable. The predictor variable with the smallest test error was
median_incomeatMSE= 33914834259.64. - The polynomial regression model had a test error of
MSE= 19236628606.76.
Conclusion
Based on the test errors for each Machine Learning method, we can conclude that the ridge regression model is best (compared to the other 2 methods) in predicting the median_house_value for blocks in California around the year 1990 since this method has the smallest test error MSE = 6059454667.62.
The ridge regression model pertaining to our data is:
median_house_value = -14991.26 + 1739.05housing_median_age + -5.02total_rooms + 54.77total_bedrooms + -27.28population + 69.25households + 4.12median_income